Systems and methods of dynamic branch prediction in a microprocessor

ABSTRACT

A hybrid branch prediction scheme for a multi-stage pipelined microprocessor that combines features of static and dynamic branch prediction to reduce complexity and enhance performance over conventional branch prediction techniques. Prior to microprocessor deployment, a branch prediction table is populated using static branch prediction techniques by executing instructions analogous to those to be executed during microprocessor deployment. The branch prediction table is stored, and then loaded into the BPU during deployment, for example, at the time of microprocessor power on. Dynamic branch prediction is then performed using the pre-loaded data, thereby enabling dynamic branch prediction with a required “warm-up” period. After resolving each branch in the selection stage of the microprocessor instruction pipeline, the BPU is updated with the address of the next instruction that resulted from that branch to enhance performance.

CROSS REFERENCE TO RELATED APPLICATION(S)

This application claims priority to provisional application No.60/572,238 filed May 19, 2004, entitled “Microprocessor Architecture”hereby incorporated by reference in its entirety.

FIELD OF THE INVENTION

This invention relates generally to microprocessor architecture and morespecifically to improved systems and methods for performing branchprediction in a multi-stage pipelined microprocessor.

BACKGROUND OF THE INVENTION

Multistage pipeline microprocessor architecture is known in the art. Atypical microprocessor pipeline consists of several stages ofinstruction handling hardware, wherein each rising pulse of a clocksignal propagates instructions one stage further in the pipeline.Although the clock speed dictates the number of clock signals andtherefore pipeline propagations per second, the effective operationalspeed of the processor is dependent partially upon the rate thatinstructions and operands are transferred between memory and theprocessor.

One method of increasing processor performance is branch prediction.Branch prediction uses instruction history to predict whether a branchor non-sequential instruction will be taken. Branch or non-sequentialinstructions are processor instructions that require a jump to anon-sequential memory address if a condition is satisfied. When aninstruction is retrieved or fetched, if the instruction is a conditionalbranch, the result of the conditional branch, that is, the address ofthe next instruction to be executed following the conditional branch, isspeculatively predicted based on past branch history. This predictive orspeculative result is injected into the pipeline by referencing a branchhistory table. Whether or not the prediction is correct will not beknown until a later stage of the pipeline. However, if the prediction iscorrect, several clock cycles will be saved by not having to go back toget the next non-sequential instruction address.

If the prediction is incorrect, the current pipeline behind the stage inwhich the prediction is determined to be incorrect must be flushed andthe correct branch inserted back in the first stage. This may seem likea severe penalty in the event of an incorrect prediction because itresults in the same number of clock cycles as if no branch predictionwere used. However, in applications where small loops are repeated manytimes, such as applications typically implemented with embeddedprocessors, branch prediction has a sufficiently high success rate thatthe benefits associated with correct predictions outweigh the cost ofoccasional incorrect predictions—i.e., pipeline flush. In these types ofembedded applications branch prediction can achieve accuracy over ninetypercent of the time. Thus, the risk of predicting an incorrect branchresulting in a pipeline flush is outweighed by the benefit of savedclock cycles.

There are essentially two techniques for implementing branch prediction.The first, dynamic branch prediction, records runtime program flowbehavior in order to establish a history that can be used at the frontof the pipeline to predict future non-sequential program flow. When abranch instruction comes in, the look up table is referenced for theaddress of the next instruction which is then predictively injected intothe pipeline. Once the look up table is populated with a sufficientamount of data, dynamic branch prediction significantly increasesperformance. However, this technique is initially ineffective, and caneven reduce system performance until a sufficient number of instructionshave been processed to fill the branch history tables. Because of therequired “warm-up” period for this technique to become effective,runtime behavior of critical code could become unpredictable making itunacceptable for certain embedded applications. Moreover, as notedabove, mistaken branch predictions result in a flush of the entirepipeline wasting clock cycles and retarding performance.

The other primary branch prediction technique is static branchprediction. Static branch prediction uses profiling techniques to guidethe complier to generate special branch instructions. These specialbranch instructions typically include hints to guide the processor toperform speculative branch prediction earlier in the pipeline when notall information required for branch resolution is yet available.However, a disadvantage of static branch prediction techniques is thatthey typically complicate the processor pipeline design becausespeculative as well as actual branch resolution has to be performed inseveral pipeline stages. Complication of design translates to increasedsilicon footprint and higher cost. Static branch prediction techniquescan yield accurate results but they cannot cope with variation ofrun-time conditions. Therefore, static branch prediction also suffersfrom limitations which reduce its appeal for critical embeddedapplications.

Thus, it would be desirable to have a branch prediction technique thatameliorates and ideally eliminates one or more of the above-noteddeficiencies of conventional branch prediction techniques. However, itshould be appreciated that the description herein of various advantagesand disadvantages associated with known apparatus, methods, andmaterials is not intended to limit the scope of the invention to theirexclusion. Indeed, various embodiments of the invention may include oneor more of the known apparatus, methods, and materials without sufferingfrom their disadvantages.

As background to the techniques discussed herein, the followingreferences are incorporated herein by reference: U.S. Pat. No. 6,862,563issued Mar. 1, 2005 entitled “Method And Apparatus For Managing TheConfiguration And Functionality Of A Semiconductor Design” (Hakewill etal.); U.S. Ser. No. 10/423,745 filed Apr. 25, 2003, entitled “Apparatusand Method for Managing Integrated Circuit Designs”; and U.S. Ser. No.10/651,560 filed Aug. 29, 2003, entitled “Improved ComputerizedExtension Apparatus and Methods”, all assigned to the assignee of thepresent invention.

SUMMARY OF THE INVENTION

Various embodiments of the invention may ameliorate or overcome one ormore of the shortcomings of conventional branch prediction techniquesthrough a hybrid branch prediction technique that takes advantage offeatures of both static and dynamic branch prediction.

At least one exemplary embodiment of the invention may provide a methodof performing branch prediction in a microprocessor having a multi-stageinstruction pipeline. The method of performing branch predictionaccording to this embodiment comprises building a branch predictionhistory table of branch prediction data through static branch predictionprior to microprocessor deployment, storing the branch prediction datain a memory in the microprocessor, loading the branch prediction datainto a branch prediction unit (BPU) of the microprocessor upon poweringon, and performing dynamic branch prediction with the BPU based on thepreloaded branch prediction data.

At least one additional exemplary embodiment of the invention mayprovide a method of enhancing branch prediction performance of amulti-stage pipelined microprocessor employing dynamic branchprediction. The method of enhancing branch prediction performanceaccording to this embodiment comprises performing static branchprediction to build a branch prediction history table of branchprediction data prior to microprocessor deployment, storing the branchprediction history table in a memory in the microprocessor, loading thebranch prediction history table into a branch prediction unit (BPU) ofthe microprocessor, and performing dynamic branch prediction with theBPU based on the preloaded branch prediction data.

Yet an additional exemplary embodiment of the invention may provide anembedded microprocessor architecture. The embedded microprocessorarchitecture according to this embodiment comprises a multi-stageinstruction pipeline, and a BPU adapted to perform dynamic branchprediction, wherein the BPU is preloaded with branch history tablecreated through static branch prediction, and subsequently updated tocontain the actual address of the next instructed that resulted fromthat branch during dynamic branch prediction.

Other aspects and advantages of the invention will become apparent fromthe following detailed description, taken in conjunction with theaccompanying drawings, illustrating by way of example the principles ofthe invention.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram illustrating a multistage instruction pipelineof a conventional microprocessor core;

FIG. 2 is a flow chart illustrating the steps of a method for performingdynamic branch prediction based on preloaded static branch predictiondata in accordance with at least one exemplary embodiment of theinvention; and

FIG. 3 is a block diagram illustrating the flow of data into and out ofa branch prediction unit in accordance with at least one exemplaryembodiment of the invention.

DETAILED DESCRIPTION OF THE DISCLOSURE

The following description is intended to convey a thorough understandingof the invention by providing specific embodiments and details involvingvarious aspects of a new and useful microprocessor architecture. It isunderstood, however, that the invention is not limited to these specificembodiments and details, which are exemplary only. It further isunderstood that one possessing ordinary skill in the art, in light ofknown systems and methods, would appreciate the use of the invention forits intended purposes and benefits in any number of alternativeembodiments, depending upon specific design and other needs.

FIG. 1 illustrates a typical microprocessor core 100 with a multistageinstruction pipeline. The first stage of the microprocessor core 100 isthe instruction fetch stage (FET) 110. In the instruction fetch stage110, instructions are retrieved or fetched from instruction RAM 170based on their N-bit instruction address. During instruction fetches, acopy of the instruction, indexed by its address, will be stored in theinstruction cache 112. As a result, future calls to the same instructionmay be retrieved from the instruction cache 112, rather than therelatively slower instruction RAM 170.

Another typical component of the fetch stage 110 of a multi-stagepipelined microprocessor is the branch prediction unit (BPU) 114. Thebranch prediction unit 114 increases processing speed by predictingwhether a branch to a non-sequential instruction will be taken basedupon past instruction processing history. The BPU 114 contains a branchlook-up or prediction table that stores the address of branchinstructions and an indication as to whether the branch was taken. Thus,when a branch instruction is fetched, the look-up table is referenced tomake a prediction as to the address of the next instruction. Asdiscussed herein, whether or not the prediction is correct will not beknown until a later stage of the pipeline. In the example shown in FIG.1, it will not be known until the sixth stage of the pipeline.

With continued reference to FIG. 1, the next stage of the typicalmicroprocessor core instruction pipeline is the instruction decode stage(DEC) 120, where the actual instruction is decoded into machine languagefor the processor to interpret. If the instruction involves a branch ora jump, the target address is generated. Next, in stage (REG) 130, anyrequired operands are read from the register file. Then, in stage (EXEC)140, the particular instruction is executed by the appropriate unit.Typical execute stage units include a floating point unit 143, amultiplier unit 144, an arithmetic unit 145, a shifter 146, a logicalunit 147 and an adder unit 148. The result of the execute stage 140 isselected in the select stage (SEL) 150 and finally, this data is writtenback to the register file by the write back stage (WB) 160. Theinstruction pipeline increments with each clock cycle.

Referring now to FIG. 2, a flow chart illustrating the steps of a methodfor performing dynamic branch prediction based on preloaded staticbranch prediction data in accordance with at least one exemplaryembodiment of this invention is illustrated. As discussed above, dynamicbranch prediction is a technique often employed to increase pipelineperformance when software instructions lead to a non-sequential programflow. The problem arises because instructions are sequentially fed intothe pipeline, but are not executed until later stages of the pipeline.Thus, the decision as to whether a non-sequential program flow(hereinafter also referred to as a branch) is to be taken or not, is notresolved until the end of the pipeline, but the related decision ofwhich address to use to fetch the next instruction is required at thefront of the pipeline. In the absence of branch prediction, the fetchstage would then have to fetch the next instruction after the branch isresolved leaving all stages of the pipeline between the resolution stageand the fetch stage unused. This is an undesired hindrance toperformance. As a result, the choice as to which instruction to fetchnext is made speculatively or predictively based on historicalperformance. A branch history table is used in the branch predictionunit (BPU) which indexes non-sequential instructions by their addressesin association with the next instruction taken. After resolving a branchin the select stage of the pipeline, the BPU is updated with the addressof the next instruction that resulted from that branch.

To alleviate the limitations of both dynamic and static branchprediction techniques, the present invention discloses a hybrid branchprediction technique that combines the benefits of both dynamic andstatic branch prediction. With continued reference to FIG. 2, thetechnique begins in step 200 and advances to step 205 where staticbranch prediction is performed offline before final deployment of theprocessor, but based on applications which will be executed by themicroprocessor after deployment. In various exemplary embodiments, thisstatic branch prediction may be performed using the assistance of acomplier or simulator. For example, if the processor is to be deployedin a particular embedded application, such as an electronic device, thesimulator can simulate various instructions for the discrete instructionset to be executed by the processor prior to the processor beingdeployed. By performing static branch prediction a table of branchhistory can be fully populated with the actual addresses of the nextinstruction after a branch instruction is executed.

After developing a table of branch prediction data during static branchprediction, operation of the method continues to step 210 where thebranch prediction table is stored in memory. In various exemplaryembodiments, this step will involve storing the branch prediction tablein a non-volatile memory that will be available for future use by theprocessor. Then, in step 215, when the processor is deployed in thedesired embedded application, the static branch prediction data ispreloaded into the branch history table in the BPU. In various exemplaryembodiments, the branch prediction data is preloaded at power-up of themicroprocessor, such as, for example, at power-up of the particularproduct containing the processor.

Operation of the method then advances to step 220 where, during ordinaryoperation, dynamic branch prediction is performed based on the preloadedbranch prediction data without requiring a warm-up period or withoutunstable results. Then, in step 225, after resolving each branch in theselection stage of the multistage processor pipeline, the branchprediction table in the BPU is updated with the results to improveaccuracy of the prediction information as necessary. Operation of themethod terminates in step 230. It should be appreciated that in variousexemplary embodiments, each time the processor is powered down, that the“current” branch prediction table may be stored in non-volatile memoryso that each time the processor is powered up, the most recent branchprediction data is loaded into the BPU.

Referring now to FIG. 3, a block diagram illustrating the flow of datainto and out of a branch prediction unit 314 in accordance with at leastone exemplary embodiment of the invention is illustrated. In the Fetchstage 310 of the instruction pipeline, the BPU 314 maintains a branchprediction look-up table 316 that stores the address of the nextinstruction indexed by the address of the branch instruction. Thus, whenthe branch instruction enters the pipeline, the look-up table 316 isreferenced by the instruction's address. The address of the nextinstruction is taken from the table 316 and injected in the pipelinedirectly following the branch instruction. Therefore, if the branch istaken then the next instruction address is available at the next clocksignal. If the branch is not taken, the pipeline must be flushed and thecorrect instruction address injected back at the fetch stage 310. In theevent that a pipeline flush is required, the look-up table 316 isupdated with the actual address of the next instruction so that it willbe available for the next instance of that branch instruction.

While the foregoing description includes many details and specificities,it is to be understood that these have been included for purposes ofexplanation only. The embodiments of the present invention are not to belimited in scope by the specific embodiments described herein. Forexample, although many of the embodiments disclosed herein have beendescribed with reference to branch prediction in embedded RISC-typemicroprocessors, the principles herein are equally applicable to branchprediction in microprocessors in general. Indeed, various modificationsof the embodiments of the present inventions, in addition to thosedescribed herein, will be apparent to those of ordinary skill in the artfrom the foregoing description and accompanying drawings. Thus, suchmodifications are intended to fall within the scope of the followingappended claims. Further, although the embodiments of the presentinventions have been described herein in the context of a particularimplementation in a particular environment for a particular purpose,those of ordinary skill in the art will recognize that its usefulness isnot limited thereto and that the embodiments of the present inventionscan be beneficially implemented in any number of environments for anynumber of purposes. Accordingly, the claims set forth below should beconstrued in view of the full breadth and spirit of the embodiments ofthe present inventions as disclosed herein.

1. A method of performing branch prediction in a microprocessor having amultistage instruction pipeline, the method comprising: building abranch prediction history table of branch prediction data through staticbranch prediction prior to microprocessor deployment; storing the branchprediction data in a memory; loading the branch prediction data into abranch prediction unit (BPU) of the microprocessor upon power on; andperforming dynamic branch prediction with the BPU based on the preloadedbranch prediction data.
 2. The method according to claim 1, furthercomprising updating the branch prediction data in the BPU if, duringinstruction processing, prediction data changes.
 3. The method accordingto claim 2 wherein updating comprises after resolving a branch in aselect stage of the instruction pipeline, updating the BPU with theaddress of a next instruction that resulted from that branch.
 4. Themethod according to claim 1, wherein building a branch predictionhistory table comprises simulating instructions that will be executed bythe processor during deployment and populating a table of branch historywith information indicating whether conditional branches were taken ornot.
 5. The method according to claim 4, wherein building comprisesusing at least one of a simulator and a compiler to generate branchhistory.
 6. The method according to claim 1, wherein performing dynamicbranch prediction with the branch prediction unit based on the preloadedbranch prediction data comprises parsing a branch history table in theBPU that indexes non-sequential instructions by their addresses inassociation with the next instruction taken.
 7. The method according toclaim 1, wherein the microprocessor is an embedded microprocessor. 8.The method according to claim 1, further comprising after performingdynamic branch prediction, storing branch history data in the branchprediction unit in a non-volatile memory for preload upon subsequentmicroprocessor use.
 9. In a multistage pipeline microprocessor employingdynamic branch prediction, the method of enhancing branch predictionperformance comprising: performing static branch prediction to build abranch prediction history table of branch prediction data prior tomicroprocessor deployment; storing the branch prediction history tablein a memory; loading the branch prediction history table into a branchprediction unit (BPU) of the microprocessor; and performing dynamicbranch prediction with the BPU based on the preloaded branch predictiondata.
 10. The method according to claim 9, wherein static branchprediction is performed prior to microprocessor deployment.
 11. Themethod according to claim 9, wherein loading the branch prediction tableis performed subsequent to microprocessor power on.
 12. The methodaccording to claim 9, further comprising updating the branch predictiondata in the BPU if, during instruction processing, prediction datachanges.
 13. The method according to claim 12, wherein themicroprocessor includes an instruction pipeline having a select stage,and updating comprises after resolving a branch in the select stage,updating the BPU with the address of the next instruction resulting fromthat branch.
 14. The method according to claim 9, wherein building abranch prediction history table comprises simulating instructions thatwill be executed by the processor during deployment and populating atable of branch history with information indicating whether conditionalbranches were taken or not.
 15. The method according to claim 14,wherein building comprises using at least one of a simulator and acompiler to generate branch history.
 16. The method according to claim9, wherein performing dynamic branch prediction with the branchprediction unit based on the preloaded branch prediction data comprisesparsing a branch history table in the BPU that indexes non-sequentialinstructions by their addresses in association with the next instructiontaken.
 17. The method according to claim 9, wherein the microprocessoris an embedded microprocessor.
 18. The method according to claim 9,further comprising after performing dynamic branch prediction, storingbranch history data in the branch prediction unit in a non-volatilememory for preload upon subsequent microprocessor use
 19. An embeddedmicroprocessor comprising: a multistage instruction pipeline; and a BPUadapted to perform dynamic branch prediction, wherein the BPU ispreloaded with branch history table created through static branchprediction, and subsequently updated to contain the actual address ofthe next instruction that resulted from that branch during dynamicbranch prediction.
 20. The microprocessor according to claim 19, whereinthe branch history table contains data generated prior to microprocessordeployment and the BPU is preloaded at power on of the microprocessor.21. The microprocessor according to claim 19, wherein after resolving abranch in a select stage of the instruction pipeline, the BPU is updatedto contain the address of the next instruction that resulted from thatbranch.
 22. The microprocessor according to claim 19, wherein the BPU ispreloaded with a branch history table created through static branchprediction during a simulation processing that simulated instructionsthat will be executed by the microprocessor during deployment andwherein the BPU comprises a branch history table that indexesnon-sequential instructions by their addresses in association with thenext instruction taken.